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<jj Abstract 

We introduce a new algorithm for finding the eigenvalues and eigenvectors 
of Hermitian matrices within a specified region, based upon the LANSO al- 
gorithm of Parlett and Scott. It uses selective reorthogonalization to avoid 
the duplication of eigenpairs in finite-precision arithmetic, but uses a new 
bound to decide when such reorthogonalization is required, and only re- 
orthogonalizes with respect to eigenpairs within the region of interest. We 
investigate its performance for the Hermitian Wilson-Dirac operator 75D in 
lattice quantum chromodynamics, and compare it with previous methods. 
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1. Introduction 

1.1. Motivation 

The problem of computing part of the spectrum of a large Hermitian matrix 
is common to many areas of computational science, but the particular appli- 
cation that motivated this work is the computation of the Neuberger operator 
for lattice QCD (Quantum Chromodynamics being the quantum field theory 
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of the strong nuclear force). This requires us to evaluate the signum function 
of the "Hermitian Dirac operator" j 5 D corresponding to some discrete lattice 
Dirac operator D, which is defined by diagonalizing this matrix and taking 
the signum (±1) of each of its eigenvalues. It is far too expensive to carry 
out the full diagonalization, so we use a Zolotarev rational approximation 
for the signum function as this can be evaluated just using matrix addition, 
multiplication, and inversion by using a multi-shift solver for its stable par- 
tial fraction expansion jl]. The approximation is expensive for eigenvalues 
of 75 D that are very close to zero, and as there are only a relatively small 
number of these we want to deflate them and take their sign explicitly. For 
this reason we need to compute the part of the spectrum of 75-D around zero. 

1.2. Outline 

We begin surveying some basic properties of symmetric matrices in order to 
introduce the notation used throughout the paper. A pedagogical review of 
simple eigensolver methods then follows, which leads on to the derivation 
of the Lanczos method with an explanation of the problems associated with 
it when using finite-precision floating point arithmetic. An overview of the 
LANSO algorithm of Parlett and Scott |2j is introduced which forms the 
starting point for the work described here. The goal of the algorithm we 
introduce in this paper is not to find the full spectrum of a large Hermitian 
matrix, but to find that part of the spectrum lying within some specified 
range. For the application described in §|5] its implementation in Chroma || 
performs significantly better than the state-of-the-art Ritz method ||, [5| . 

2. Hermitian Matrices and the Power Method 

2.1. Basic Properties of Symmetric Matrices 

A matrix A is Hermitian (with respect to a sesquilinear inner product) if A = 
A^, which means (u, Av) = (A^u, v) = (Au, v) = (v,Au)*, or equivalently 
■ Av = (Atu)t • v = (Au)''" • v = (v^ ■ Au)*. An eigenvalue A of A satisfies 
Az = Az where z 7^ is the corresponding eigenvector. The eigenvalues 
are real and the eigenvectors are orthogonal. Any matrix can be reduced 
to triangular form T by a unitary (orthogonal) transformation^ (change of 



3 This is Schur normal form, which follows from the Cayley-Hamilton theorem that 
every matrix satisfies its characteristic equation, and the fundamental theorem of algebra 
which states that the characteristic polynomial p(X) — det(A — A) has exactly N — dim(A) 
complex roots, counting multiplicity. 
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basis), A = QTQ" 1 = QTQt. For A Hermitian = (Q t AQ) t = Q f A f Q = 
Q^AQ = T it follows that T is real and diagonal; thus AQ = QT so the 
columns of Q furnish the orthonormal eigenvectors. 

2.2. Power Method 

In order to find eigenvalues and eigenvectors numerically one obvious ap- 
proach is the Power Method. An arbitrary starting vector can, in theory, be 
expanded in the orthonormal eigenvector basis {zj}, u = £)■ z j( z j, u o)- The 
matrix A is applied to u and the result normalized to get ui, and so forth: 
u fc+ i = Aufc/|| Aujfe|| , where the norm is ||x|| = ^ (x, x). We then find that 
u fc oc Ajzi(zi,Uo) + Ej>i(Aj/Ai) fc z j (z j ,u ), and as lim fc ^ 00 (A i /Ai) fc = we 
find limfc^oo Ufc = z x assuming (zi,uq) 7^ 0, where we label the eigenpairs 
such that I Ai| > | A2 1 > • • ■ > \Xn\- If the eigenvalue Ai is degenerate then 
converges to the eigenvector parallel to Uo- The rate of convergence is gov- 
erned by |A 2 /Ai| fe = e - fc ( ln l Al l- ln l A2 D. If we shift the matrix A by a constant 
then we just shift its eigenvalues by the same constant and leave the eigen- 
vectors unchanged; however, such a shift does change the rate of convergence 
of the power method. 

3. Krylov Spaces and Lanczos Algorithm 

3.1. Krylov Spaces 

We consider a sequence of subspaces of increasing dimension n such that the 
restriction of A to them converges to A as n — > 00. For an N x N matrix 
A, convergence will always occur because the approximations equal A for 
n > N. In many cases of practical interest the matrix approximates some 
compact linear operator on an oo-dimensional Hilbert space, and we expect 
the convergence to be governed by the properties of the underlying operator. 

In practice we usually do not have an explicit matrix representation of the 
large (sparse) matrix A, but we merely have some functional "black box" 
representation that allows us to apply it to a vector in ]R. N . Almost the 
only spaces we can construct from this are the Krylov spaces /C n (A, u) = 
span(u, Au, A 2 u, . . . , A n_1 u) where u is some more-or-less arbitrary starting 
vector. The only simple generalization is block Krylov spaces where we start 
from more than one vector. 
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3.2. Arnoldi Method 

The vectors {A J u} do not form an orthonormal basis for the Krylov space. 
Furthermore, the corresponding unit vectors A- 7 u/||A J u|| converge to the 
largest eigenvector of A, as they are just successive iterates of the power 
method. They therefore provide a particularly bad choice of basis for numer- 
ical computations. It is natural to construct a better orthonormal basis by 
deflation and normalization, 

j u 

q x = u/||u||, u j+ i = Aq,- - 53q*(q*, Aq,-), Qj+i = n J+1 M ; 

*=i ll u i+ill 

in other words the Gram-Schmidt procedure. This is called the Arnoldi 
method. We see immediately that (cjj+i, Aq,) = (q^+i, Uj+i) = ||u.j+i||. The 
n x n matrix Q whose columns areQ Qe., = q^ therefore furnishes an orthog- 
onal projector QQ^ = Y^]=i q? ® qj onto /C n (A, u). 

The restriction of A to the Krylov space is Hessenberg by construction: 

#1 ,71 — 1 

#2,n-l #2,n 
#3,n-l #3,71 

#71-1,71-2 #71.-1,71-1 #71-1,71 

\ • • • #71,71-1 #71,71 / 

We can diagonalize this matrix using the QR algorithm || to obtain = 
S^HS, where 6 is the diagonal matrix of Ritz values, ®ij = 9j5y, and S the 
nxn unitary (orthogonal) matrix whose columns are the corresponding Ritz 
vectors Sj = Sej. We may hope that some of the Ritz values approximate the 
eigenvalues of A, 9j ~ Xj>, and that some of the Ritz vectors approximate its 
eigenvectors, QSe^ = Qs^ = fa Zj>, provided that the residual R = AQ— QH 
is small, since A(QS) = (QH + R)S = (QS)6 + 0(||R||). 

3.3. Ldnczos Algorithm 

We are interested the special case of the Arnoldi method for a Hermitian 
matrix A, which means that the matrix H is also Hermitian, = (Q^AQ)^ = 



4 ej is a basis vector whose components are [ej]i = 5, 



H = Q f AQ 
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QtA^Q = H. A matrix which is both Hessenberg and Hermitian is tridiagonal 



H = QtAQ = 
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where = ||uj + i|| = (qj+i, Aqj) and «j = (q^, Aq^) are real. 
We thus have a three-term recurrence relation 

Ac b- = pjqj+i + "jCb- + &-iqj-i; (!) 

this defines the Ldnczos algorithm. This greatly simplifies the computation; 
not only is it easier to diagonalize a tridiagonal matrix using the QR algo- 
rithm, but also means that Aq., is automatically (implicitly) orthogonal to 
all qj except for qj_i, q«, and q«+i. Unfortunately, floating-point arithmetic 
does not respect implicit orthogonality. 

3.4- Loss of orthogonality among the Ldnczos vectors 

As noted in the previous section, with a basic implementation of the Lanczos 
algorithm, orthogonality amongst the Lanczos vectors is lost due to round- 
ing errors. The most obvious indication of this loss of orthogonality is the 
appearance of spurious copies of eigenvectors. It is interesting to store the 
Lanczos vectors to measure the loss of orthogonality directly, as it allows us 
to see where the loss of orthogonality occurs. The results £1X6 clS expected: 
with the basic Lanczos algorithm (i.e., one with no reorthogonalization) the 
orthogonality of a Lanczos vector with respect to those calculated more than 
two steps previously is only implicit; consequently, as rounding errors in- 
evitably bring back components of the early Lanczos vectors, there is nothing 
to suppress these components. They therefore grow in an unrestrained man- 
ner until eventually orthogonality between the most recent Lanczos vector 
and those calculated early on in the procedure is completely lost. This was 
demonstrated in wheref] log 10 (q*qfc/e) is displayed as a symmetric array 
of numbers with small values representing mutually orthogonal vectors and 
large ones representing pairs of vectors with a large overlap. 



5 The "unit of least precision" , e, is the smallest number such that 1 © e ^ 1 in floating- 
point arithmetic, it is approximately f0~ 7 for single precision and 10 -14 for double preci- 
sion. 
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t 

(q*,qj) = o 



t 

(<b,qj-) = 1 




Figure 1: The orthogonality of the Lanczos vectors without reorthogonal- 
ization. Increasing Lanczos iterations, % and j are shown in the 4- and — > 
directions. 

For larger systems we can view this as a colour map, an example of which 
is shown in Figure [I]. Large values are represented at the red end of the 
spectrum and small values at the blue end. Thus the diagonal is shown 
in red (representing \\qj\\ = 1) and mutually orthogonal vectors are shown 
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in blue. 



We can see clearly that with no reorthogonalization, after sufficient steps 
the new Lanczos vectors lose orthogonality to the early ones. Note that q, 
loses orthogonality with the very earliest Lanczos vectors to a lesser extent 
compared with those which occur after a few steps. This is to be expected as 
the initial random starting vector qx will in general not contain large compo- 
nents of any particular eigenvector. However, after a few steps the Lanczos 
vectors will start to contain large components of the dominant eigenvectors 
according to the argument given in §2.2| for the power method, and it is pre- 
cisely these dominant eigenvectors that will grow from rounding errors and 
so reappear in q^. 



3.5. Degenerate Eigenspaces and Restarting 

In exact arithmetic only one eigenvector will be found for each distinct eigen- 
value: if an eigenvalue is degenerate then this vector will be the projection 
of the initial vector onto its eigenspace. In floating-point arithmetic round- 
ing errors will eventually cause the other eigenvectors to appear; this will 
take longer in higher-precision arithmetic. This may perhaps be viewed as a 
case where using floating-point arithmetic is an advantage. Such degenerate 
eigenvectors can also be found by restarting the Lanczos algorithm with a 
new initial vector and deflating with respect to the previously known good 
eigenvectors. This can be repeated until no more degenerate eigenvectors 
are found. Presumably a block version of the algorithm could be used too, 
but the choice of block size is not obvious if the maximum degeneracy is not 
known a priori. A cluster of nearby eigenvalues behaves just like a degen- 
erate subspace until sufficient accuracy to resolve the eigenvalues has been 
attained. 



4. Selective Reorthogonalization 

We will deem a Ritz vector y 3 - e /C n (A, u), where to be "good" if it lies within 
the Krylov subspace /C„/(A, u) with n' < n, that is if (y^, q&) = (QSj, Qe k ) = 
(sj,ek) ~ for k > n!\ eigenvalues that are not good will be called "bad". 
Paige P| has shown that the loss of implicit orthogonality occurs primarily in 
the direction of good Ritz vectors. This is not surprising: if q n /+i and q„'+2 
are orthogonal to an eigenvector z of A with eigenvalue A then all future 
Lanczos vectors will also be orthogonal to z in exact arithmetic. We may 
prove this by induction: assume (z, q^) = (z, q^+i) = for some k > n', then 

(z, Aq fc+ i) = (Az, q fe+1 ) = A(z, q fc+1 ) = 



7 



= (z, (3 k+ iqk+2 + afc+iqfc+i + 0k<\k) = /3fc+i(z, q/c+2), 

hence (z, qfc+2) = unless the Lanczos process terminates because j3k+i = 
0. Concomitantly, any rounding errors that appear in the computation of 
Qj for j > n' + 2 with a component parallel to z will not be suppressed 
by orthogonalization to the previous two Lanczos vectors; moreover, this 
component will grow as |A/A'| fc where A' is the largest "bad" eigenvalue. 

It therefore suffices to orthogonalize the current Lanczos vectors q n and q n +i 
explicitly with respect to good eigenvectors sufficiently frequently. This is 
much cheaper than explicitly orthogonalizing with respect to all the previous 
Lanczos vectors at each step as in the Arnoldi method. 

4.1. LAN SO 

How often do we need to carry out this reorthogonalization? As round- 
ing errors are of order e it seems reasonable to choose to do so when the 
loss of orthogonality has accumulated to be of 0(y/e). We therefore choose 
to orthogonalize q„/ and q n / + i with respect to a good Ritz vector y when 
(y^qn') > y/s- In their LANSO algorithm Parlett and Scott introduce 
two bounds, 

1. The t bound, > |(yj, qj)\, that is used to trigger reorthogonalization 
with respect to y^. This bound is computed cheaply by a three-term 
scalar recurrence. 

2. The k bound, n > ||QtQ— 1||, that is used to trigger a "pause", namely a 
search for new good eigenvectors by running the QR algorithm, followed 
by a reorthogonalization of the last two Lanczos vectors with respect to 
all good eigenvectors. This is computed by a more complicated scalar 
recurrence. 

4-1.1. Monitoring the k and t bounds 

The success of the LANSO method hinges on the ability of Kj and to 
bound well enough the actual values ||1 — Q*Q|| and (y^, qj) respectively for 
any Lanczos step j and all good eigenvectors y& calculated thus far. For our 
relatively small test cases we can store all the Lanczos vectors which make 
up Q, and all the known good eigenvectors. This enables us to calculate 
the values of ||1 — Q^Q|| and (y/c,q,-) to compare with these bounds. This 
information is plotted in Figure where the r bound is plotted together 
with the value which it is supposed to bound. The points at which the 
k bound triggers a pause are also shown. This figure reveals a number of 
features. Firstly, r k j > |(yfc,qj)| as required. However, the bounds appear 
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Figure 2: The upper figure plots the values of |(yfc, q^) | in red and its bound 
in green for each y*. for several different Ritz values The blue vertical 
dotted lines show the points at which a diagonalization of H is triggered by 
the k bound. The lower figure is similar but here the red vertical lines show 
the a bound being used to trigger a full diagonalization. In this example the 
n bound was used to trigger the very first pause, but this is not needed: er 
could be used from the beginning. 

rather pessimistic: the k bound exceeds the tolerance y/e and triggers a 
pause (recall this entails the calculation of the spectrum of the tridiagonal 
matrix) far more frequently than needed, and the r bound is often many 
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orders of magnitude larger than the quantity it is bounding; however, the 
reorthogonalization triggered by this is relatively inexpensive. Due to the 
frequent triggering by the k bound, in practice it is the k bound and never 
the individual r bounds which triggers reorthogonalization. 

4-2. New Algorithm 

In our application, as in many others, we do not need to find all the eigenpairs: 
it suffices to find those in a pre-specified region E of the spectrum. We 
only need search for eigenvalues in E and selectively reorthogonalize Lanczos 
vectors with respect to them; we are not interested if duplicate eigenvectors 
occur outside E. In passing, we note that it is easy to restrict the QR iteration 
to search in the region by a judicious choice of shifts (see §p.2|). 

Our algorithm replaces both LANSO bounds with a bound a that is a gen- 
eralization of the t bound, a bounds the loss of orthogonality of a Lanczos 
vector <\j with respect to any good Ritz vector y within E, even if y is not 
explicitly known. We shall require that <jj > max fc9fc£S |(yfc,qj)l) where the 
maximum is taken over all good Ritz pairs in E. 

a is calculated via a three term recurrence relation closely related to that for 
the r bound. We consider the propagation and amplification of the lack of 
orthogonality of the good Ritz vectors with current Lanczos vectors and ig- 
nore other inconsequential rounding errors as in ||. Taking the inner product 
of (|1]) with y fc gives 

(y*» Ac b0 - (yfc, qj-O/Vi _ (y*> ~ (yfc' ty+O^- = °- ( 2 ) 

If yk = Qs/c is a good Ritz vector within E, where (9 k , s k ) is a Ritz pair 
(Hs fc = s k 9 k ) then 

(yjfe> Aqj) = (Ay^qj) = (AQs fe ,q,,) = (QHs^q,) + (Rs k , q/) = {y k ,(lj)9 k , 

(3) 

as the residual is orthogonal to the Krylov space, Q^R = Q^AQ — Q^QH = 0. 
From (H) and @ we obtain 

[yk,qj+i)Pj = (yfc,q;)(#fc - - (yfc,q;-i)/Vi- 

We assume by induction that a, > \(y k , q,)| Vfc : 9 k G E, Vz < j, hence 

l(y*» qj+i)l W < l(y*>q/)l % - %l + l(yfc,q?-i)l \Pj-i\ 

< max aj\0 - otj \ + <r j -x\P S - 1 \\ 
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so we may define 



max 10-0,1^ + |/?i-ikj-i 



where the "initial values" o t -\ = 0{e) an d °~t = correspond to the lack 
of orthogonality after selectively orthogonalizing a good Ritz vector using a 
finite precision arithmetic implementation of Gram-Schmidt, t being the last 
iteration at which the algorithm was paused to search for new Ritz pairs. We 
shall not give a detailed analysis of this algorithm here but it is very similar 
for that for LANSO given in [gf. 

We are interested in applying our algorithm to low density interior regions of 
the spectrum. The algorithm is surprisingly effective as we find such interior 
eigenvalues converge rapidly in a manner reminiscent of extremal eigenvalues. 
The reason why eigenvalues in low-density regions are so well represented in 
the Krylov space is explained in ||. 

4-2.1. Results of the new algorithm 

The lower panel of Figure |2] shows the effect of using a to trigger a pause. 
We see immediately that when using the a procedure, diagonalization of H 
is performed far less frequently than is the case when using the k procedure. 



5. Calculating low eigenvalues of the Fermion matrix 

The Lanczos method itself can be used to diagonalize only Hermitian matri- 
ces, but the matrices are not required to be positive definite. The Wilson- 
Dirac fermion matrix D is not Hermitian, but we can exploit the fact that 
our matrix is "75-Hermitian" , 75D75 = D\ where 75 is a product of the four 
Hermitian gamma matrices 75 = 71727374 which satisfy the anticommutation 
relations {7^,7^} = 25^. This allows us to construct the Hermitian matrix 
75D. 

We are interested in the eigenvalues close to gaps in the spectrum, and for 
75 D there is such a gap around zero. These eigenvalues map to extremal 
eigenvalues of D^D = (75D) 2 , but if we use D^D then we have to consider the 
extra work involved in resolving the sign of the corresponding eigenvalues of 
75 D. This also involves dealing with any mixing which takes place due to the 
near degeneracy of the approximate eigenvalues, since eigenvalues A 2 of D^D 
might be mixtures of eigenvalues of 75 D near either ±A. 
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When using the Lanczos method we see eigenvalues of both large and small 
magnitude being resolved first giving two regions in the case of D^D (corre- 
sponding to both large and small eigenvalues), and four regions in the case 
of 75 D (corresponding to large and small eigenvalues both positive and neg- 
ative). Figure § shows a bar chart of the relative number of small and large 
converged eigenvalues (regardless of sign) determined at each pause for both 
75 D and D+D. 
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Small and large eigenvalues of MdagM Lanczos iterations 



smail eigs of MdagM I 
large eigs of MdagM i 



mm 



370 496 622 748 874 1 000 1 1 05 1 1 89 

Lanczos sweeps {= Matrix-vec ops x 2) 



Small and large eigenvalues of g5M Lanczos iterations 



small eigs of g5M 
large eigs of g5M 



245 371 497 699 825 951 1 1 53 1279 1 481 1 683 1 809 1 935 2061 21 87 2368 
Lanczos sweeps (= Matrix-vec ops) 



Figure 3: The number of large and small magnitude eigenvalues of H = 75 M 
found as a function of the number of Lanczos steps (dimension of the Krylov 
space). The horizontal axes are scaled to same number of 75-D applications. 



The fact that Figure ^| shows many more large eigenvalues being resolved 
than small ones gives good motivation for our earlier assertion that we should 
only look for eigenvalues within the region of interest. If we were to find, 
construct, and reorthogonalize with respect to all converged eigenvalues at a 
given Lanczos step most of the time would be spent preserving the orthogo- 
nality of regions we are not interested in. 

As stated earlier, we are interested in the eigenvalues which are close to a 
gap in the eigenspectrum around zero. The convergence rates for extremal 
eigevalues, i.e., those at either end of the spectrum, are well understood 
following the work of Kaniel [[K|, Paige [11] and Saad ||12|| . This explains why, 
in the case of D^D where all the eigenvalues are positive, we see the largest 
and smallest eigenvalues converge quickly compared with interior ones. In 
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the case of the matrix 75 D we see the eigenvalues smallest in magnitude 
converging quickly. These eigenvalues are not at the extremes of the spectrum 
but are close to a relatively large void in the spectrum around zero. The 
convergence rates for such "interior" eigenvalues is explained in || where we 
consider the Kaniel-Paige-Saad bounds applied to the shifted and squared 
matrix (in this case the optimal shift is zero). Figure ^ shows a comparison 
of our theoretical bounds with the errors found when finding the eigenvalues 
close to a gap in the eigenspectrum of the Fermion matrix. 




1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000 

n 



Figure 4: Graph showing the error in eigenvalue estimates as a function of 
iteration number n (Krylov subspace dimension). We compare this with the 
theoretical bounds indicated by the dashed lines. The error is determined 
by taking the absolute value of the difference between the measured Ritz 
values and the nearest eigenvalue (approximated by the most accurate Ritz 
value we obtain at the end of the run). The eigenvalue used for a given Ritz 
value is indicated by different symbols as indicated in the legend. When the 
error is large this association is somewhat arbitrary but it is unambiguous 
for the range of errors shown in this graph. The lines correspond to the 
bounds obtained using the results of , again using the spectrum as approx- 
imated using the most accurate Ritz values. We see that the purple squares 
(A = —1.066322) and green triangles (A = 1.066259) seem to correspond to 
two orthogonal eigenvectors belonging to degenerate (or very nearly degen- 
erate) eigenvalues. If they were actually degenerate then the second eigen- 
vector would be a fortuitous consequence of rounding error. The agreement 
between the observed rate of convergence and the theoretical bounds is quite 
satisfactory. 
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6. Implementation details 



All results obtained here have been obtained using the Chroma || package 
running on 4,096 cores of the UK National Supercomputing Service HEC- 
ToR |TB|, after a prototype code was initially implemented in Maple flU 



The Chroma implementation consists of highly optimized parallel linear al- 
gebra routines specifically written for lattice QCD, thus we can assume that 
matrix-vector products, inner-products and general manipulation of vectors 
and matrices are already optimized. Here we seek to minimize the number of 
calls to these operations but not to optimize them further. However, we do 
give some consideration here to patterns of access to large vectors stored in 
memory, particularly when constructing eigenvectors, and we also consider 
some optimization of the currently serial diagonalization of the tridiagonal 
matrix H using the QR method. 



6.1. Constructing eigenvectors 

Following each application of the QR method, we need to calculate the vec- 
tors yi = Qsj, where are the columns of S, i.e., the Ritz vectors. This 
means that each good Ritz vector is constructed as a linear combination 
of Lanczos vectors. The most straightforward method for constructing each 
eigenvector is via a simple loop as follows 



DO i = 1 to # good Ritz vectors 
DO j = 1 to # Lanczos vectors 

y[i] = y[i] + q[j] * S[j,i] 
END DO 

END DO 



where the number of good Ritz vectors is expected to be much smaller than 
the number of Lanczos vectors. 

However, this may not be the most efficient ordering. After many Lanczos 
iterations we will have a large number of Lanczos vectors and they may not 
all be available in fast memory. We therefore need to ensure that once a 
Lanczos vector is retrieved from memory we make the most efficient use of 
it, reducing the need for multiple loads and stores of the vector to and from 
memory. It may even be that we cannot store all of the Lanczos vectors, and 
need to reconstruct them on the fly. It therefore makes sense to access (or 
reconstruct) each Lanczos vector in turn and build up the good Ritz vectors 
together, by interchanging the order of the loops 
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DO j = 1 to # Lanczos vectors 

Recalculate/access q[j] 

DO i = 1 to # good Ritz vectors 
y[i] = y[i] + q[j] * S[j,i] 

END DO 
END DO 



In both cases the Ritz vectors are accessed and updated within the inner 
loop but the second method should result in fewer accesses to the Lanczos 
vectors, q^. Experiments show an average speed-up of approximately 50% in 
this case. 

There are some further interesting architecture-dependent trade-offs that 
could be investigated. Depending on the amount of memory available and 
the memory bandwidth we can choose between 

1. Storing the Lanczos vectors in main memory (DRAM); 

2. Storing the Lanczos vector in secondary storage (disk or Flash RAM); 

3. Recomputing the Lanczos vectors at each pause. This minimizes off- 
chip data transfer, and is "embarrassingly parallel" up to a few global 
sum operations (for inner products and norms). 

A full investigation of these options has not been performed here. 



6.2. Diagonalization of H : QR 

We need to pause the Lanczos process periodically to determine the eigen- 
spectrum of the tridiagonal matrix H. This can be achieved efficiently using 
the iterative implicit QR algorithm || with suitable shifts. 

Many implementations of implicit QR methods exist. The results here were 
obtained using Lapack |T3] routines built on top of BLAS accelerated 



using the ACML library |I7| . The DSTEV Lapack routine could be used to 



determine all the eigenvalues, and optionally all eigenvectors, of a symmetric 
tridiagonal matrix. This works well for our needs; however, we are only 
interested in eigenvalues from within a region E, which can give a significant 
performance benefit. We are better off employing a routine such as DSTEVX 
which finds eigenvalues only within a specified interval. In the case where E 
is a non- contiguous range, this may result in the routine being called several 
times, once for each range, or the algorithm could be rewritten to work with a 
disjoint range. We could also make use of previously known good eigenvalues 
as shifts, but this has not been implemented. 
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7. Results 




Figure 5: Breakdown of time spent in various parts of the new algorithm 
versus Lanczos iteration for 75 D on a 24 3 x 48 lattice with 12 degrees of 
freedom per lattice site (i.e., N = 7, 962, 624) on 4,096 cores of a Cray XT4. 
"Constructing Ritz vectors" means computing y = Qs, and "Purging good 
eigenvectors" means reorthogonalising the last two Lanczos vectors with all 
known good Ritz vectors. The x-axis shows the iteration numbers at which 
the algorithm is paused. The frequency of pauses is such that the x-axis scale 
is approximately linear. 



Figure |5| shows a breakdown of the various components of the algorithm 
when running on the largest processor count attempted (4,096) for our new 
algorithm applied to 75 D. We find that with our implementation the most 
expensive operation is the creation of the eigenvectors of 75 D followed by 
the application of the QR method, which is why we wish to create as few 
eigenvectors as possible. It may also be desirable to implement a faster (e.g., 
parallel) QR method as the number of eigenvalues required becomes larger. 

Figure [5] shows that the speed-up of the creation of eigenvalues with processor 
count is super-linear. This is due to the fact that with increasing processor 
count the number of eigenvectors which can be held in cache on each processor 
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increases as the local sub-vectors become smaller. The net result is a super- 
linear speed-up of the entire algorithm with processor count. 



Speed-up by component of code 
45 i 




512 1,024 1,536 2,048 2,560 3,072 3,584 4,096 



Processor count 

Figure 6: Parallel speed up of new algorithm and its components 

In order to illustrate the efficiency of our implementation of our new vari- 
ant of the LANSO algorithm to find low-lying eigenmodes of the fermion 
matrix we compare it with the current state-of-the-art, the Chroma im- 
plementation of the Kalkreuter-Simma algorithm described in Q. This 
method uses a conjugate gradient (CG) method to minimize the Ritz func- 
tional fi(z) = (z, Az) / 1| z || 2 with A = (75D) 2 , where z is deflated with respect 
to all previously computed eigenvectors. The CG minimization alternates 
with a diagonalization of 75 D on the subspace of computed eigenvectors to 
separate eigenvalues of 75 D of different sign but the same magnitude, taking 
into account that we may not know the full degenerate subspaces. 

Comparing like- wit h-like for the various methods of determining eigenpairs is 
not completely straightforward as one has to consider some kind of tolerance 
within which the eigenvalues are determined. In the case of the Kalkreuter- 
Simma algorithm convergence is specified by stopping criteria on the CG 
method, whereas in our new algorithm we determine whether a Ritz pair 
(8, y) has converged by looking at the bottom component of the Ritz vector. 
Moreover, we continue to refine the eigenpairs at each pause, so their accuracy 
improves: we could deflate with respect to sufficiently good eigenvectors but 
we have not studied this option. 
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We compare the results studying the norm of the residual vector ||(A — 0)y\\. 
We adjust the relevant stopping criteria and tolerances until we see similar 
magnitudes of this norm and then compare the result in terms of the overall 
computation time: the results are in Figure [F[ 



Performance of New Algorithm 
and Ritz Functional Method 

10,000 t 




1 1 

10 20 30 40 50 60 70 

Number of eigenvalues 

Figure 7: Comparison of our algorithm and the Ritz functional method M, [5[] 
implemented in Chroma and run on HECToR as a function of the number 
of small magnitude eigenpairs of 75 D found. Results are shown for different 
residual values for the Ritz method; the corresponding errors for our method 
are always smaller than the best Ritz functional estimates, and decrease as 
the Krylov space grows. 



8. Conclusions 

We have introduced a new algorithm to determine the eigenpairs of large 
Hermitian matrices based on the LANSO method of Parlett and Scott, and 
implemented and tested it on a realistic large-scale computation in lattice 
QCD. Our algorithm differs in two ways from LANSO: it only determines 
eigenpairs within a specified region of the spectrum, as this is all that is 
needed, and it uses a new a bound to trigger "pauses" at which Ritz pairs are 
computed and selective reorthogonalization performed. We found that this 
reduces the number of such pauses significantly, and moreover far less work 
is required as we only need to construct the eigenvectors we are interested in. 
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Our method compares very favourably with the methods that are currently 
in use, and promises to be useful for other problems such as "low-mode 
averaging" in QCD calculations as well as in applications in other areas. We 
have indicated several possible improvements that could be studied in future. 
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